My Motivation:

Recently, I had a conversation with a barber about her home’s water quality. She complained that the water smelled like a rotten egg so she had to buy bottled water for cooking and drinking. She lives in East Palo Alto which is not very far away from the Stanford Campus located in Palo Alto. The drinking water supply for the bay area comes from the Hetch Hetchy reservoir. However, the water quality in the region can be different depending on the methods to treat drinking water, potential chemical contaminations, etc. The SF Gate news article stated that the drinking water quality in the Bay Area meets the federal health guideline. However, some toxic chemical concentrations are higher than the levels from scientific studies. The article also provided an example of more than 100 fold chromium concentration differences between Hayward and Daly City. As a result, there are situations where drinking water quality can vary greatly despite the water coming from the same source.

Since the current drinking water quality does not reflect the previous contaminations, a dataset that factors in the toxic chemical contaminations and violations can help us better understand the differences among cities in the Bay Area. CalEnviroScreen provides data of drinking water contaminant index. It accounts for the contaminations, level of contaminants and past violations in the drinking water. However, it does not measure the provider’s current compliance with the regulations nor does it reflect the most recent drinking water quality. My hypothesis is that regions with high-income residents may have a lower drinking water contaminant index than low-income areas. I decided to use American Community Survey(ACS) data which provides the income information. Combining both ACS and CalEnviroScreen data can test if my hypothesis is correct. In addition to the income analysis, I also want to know if minorities are more likely to experience a lower drinking water index in the past.

From the map above, we can see that the drinking water index is lower in east part of the bay area than anywhere else in the bay area.

The leaflet() map above shows that the northern bay and San Jose area have a higher drinking water index. In other words, there are more violations or contaminants in these regions in the past.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   103.3   198.0   267.2   277.1   349.5   941.4       3

The Equity analysis above shows that there is a disproportionate number of white households experience high drinking water index. From 600 to 1000 index values, there is a significant higher proportion of white households compare to the total white household proportion. The result indicates that white households in the bay area have experienced some unfair drinking water contamination between 2005 to 2013. However, there is also a greater proportion of white race than any other race in the bay area. Note, the 1000-1100 data are NAs.

Continue to Income vs. Water Quality Index

The equity analysis above does not reflect there is any inequity between incomes of households and Drinking Water Index.

Survey Regression on Income vs. Drinking Water Index

## 
## Call:
## lm(formula = `Drinking Water` ~ per_over75k, data = bay_water_income_lm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -191.92  -80.55    0.73   52.79  545.54 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 245.1362    20.6484  11.872   <2e-16 ***
## per_over75k   1.4227     0.8567   1.661   0.0978 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 121.6 on 314 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.008706,   Adjusted R-squared:  0.005549 
## F-statistic: 2.758 on 1 and 314 DF,  p-value: 0.09779

Based on the above regression analysis, the drinking water index will increase by 1.4227 for every 1 percent increase in percent households with income that is greater than 75,000. However, the p-value for this analysis is 0.0978 which is greater than 0.05. As a result, the percent households with income that is greater than 75,000 is not a good predictor for drinking water index. Also, the adjusted R-squared value is 0.005549 which means the variation in per_over75k explain 0.5549% of the variation in drinking water index.

Now try to factor in Race

## 
## Call:
## lm(formula = `Drinking Water` ~ perc_white, data = bay_water_income_lm1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -230.59  -75.27    0.17   58.64  532.14 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 245.3549    14.7938  16.585   <2e-16 ***
## perc_white    1.4987     0.6126   2.446    0.015 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 121 on 314 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.0187, Adjusted R-squared:  0.01558 
## F-statistic: 5.984 on 1 and 314 DF,  p-value: 0.01498

Based on the above regression analysis, the drinking water index will increase by 1.4987 for every percent increase in percentage of white population in the bay area. The p-value for this analysis is 0.015 which is less than 0.05. As a result, the percentage of white population is a good predictor for drinking water index. Also, the adjusted R-squared value is 0.01558 which means the variation in perc_white explains 1.498% of the variation in drinking water index.

## 
## Call:
## lm(formula = `Drinking Water` ~ perc_over75k + perc_white, data = bay_water_income_lm1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -230.28  -75.09    0.37   58.38  532.24 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  244.64491   20.57857  11.888   <2e-16 ***
## perc_over75k   0.05698    1.14605   0.050    0.960    
## perc_white     1.47139    0.82368   1.786    0.075 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 121.2 on 313 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.01871,    Adjusted R-squared:  0.01244 
## F-statistic: 2.984 on 2 and 313 DF,  p-value: 0.05203

With the addition of percentage of white households, the regression analysis produces p-values that are greater than 0.05. The above analysis suggests that the drinking water index increase by 0.05698 for every 1 percent increase in $75,000 income households. It also suggests that drinking water index increases by 1.47139 for every percent increase in white proportion in the total population. Although the result coincides with the conclusion from the previous equity analysis between race and drinking water index, the p-values exceed 0.05 for both parameters. Therefore, we can not say that there is a relationship between Drinking water index and perc_over75k and perc_white. The adjusted R squared value is 0.03 which means that both race and income explains 1% of the variation in drinking water index.

Before the data correction, there was a Simpson’s paradox. Simpson’s paradox happens when a trend appears in multiple groups but the trend reverses when these groups combine together. In this scenario, there is no relationship between percent households with income that is greater than 75,000 and the drinking water index. There is a relationship between percentage of white households and the drinking water index. The slope is positive for both scenarios. When combining the two variables with the drinking water index, the slope now is still positive. As a result, there is no Simpson’s paradox in the above analysis.

To prove that there is no Simpson’s paradox. We can first look at the relationship between percentage of over $75,000 income households and the Drinking water Index with the color dimension of percentage of white population. From the scatter plot above, darker and lighter spots are equally distributed in the chart.

The black dotted line in the above chart represents an overall best fit line. The red line represents the trend for the lowest 25% of percentage of white population whereas the purple line represents the trend for the top 25% of perc_white. The red line and green line display positive trends. The two other lines indicate negative trends. It explains that the relationship is not significant when accounting both perc_over75k and perc_white.